Combine tibbles
*_join(...) is a family of functions for combining two tibbles on common variable(s) called key.
For example, take a group of people (n=7) for which we have collected age and height data:
# height
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara"),
height = c(160, 172, 182, 157, 162))
df1
# A tibble: 5 x 2
name height
<chr> <dbl>
1 Isa 160
2 Jaylinn 172
3 Mila 182
4 Milas 157
5 Yara 162
# age
df2 <- tibble(name = c("Fiene","Jaylinn","Mila","Noah","Yara"),
age = c(20,24,17,23,17))
df2
# A tibble: 5 x 2
name age
<chr> <dbl>
1 Fiene 20
2 Jaylinn 24
3 Mila 17
4 Noah 23
5 Yara 17
The information here is incomplete, for some we have only height and for some only age. Now we want to combine these tibbles into a single table with variables name, age and height. We can do this with a join function which is able to combine observations (rows) by matching on the common variable(s) between the two tibbles. Matching is done by one or more variables, in this case the variable name.
Keep in mind that for a proper joining of data, the matched variable(s), also known as the key, must be unique in both tibbles.
You may control the layout of the combined table by choosing one of the three join function laid out below.
Return all rows in df1 where there are matching values of name in df2 and all columns in both df1 and df2:
inner_join(df1,df2, by = "name")
# A tibble: 3 x 3
name height age
<chr> <dbl> <dbl>
1 Jaylinn 172 24
2 Mila 182 17
3 Yara 162 17
Return all rows from df1 and all columns in both df1 and df2, NA for missing values in df2:
left_join(df1,df2, by = "name")
# A tibble: 5 x 3
name height age
<chr> <dbl> <dbl>
1 Isa 160 NA
2 Jaylinn 172 24
3 Mila 182 17
4 Milas 157 NA
5 Yara 162 17
All rows and all columns in df1 and df2.
full_join(df1,df2, by = "name")
# A tibble: 7 x 3
name height age
<chr> <dbl> <dbl>
1 Isa 160 NA
2 Jaylinn 172 24
3 Mila 182 17
4 Milas 157 NA
5 Yara 162 17
6 Fiene NA 20
7 Noah NA 23
If the keys are not unique in either of the tibbles then the result will be ambiguous. Take for example df1 defined above, we now add another observation with name Yara and height 168:
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara", "Yara"),
height = c(160, 172, 182, 157, 162, 168))
df1
# A tibble: 6 x 2
name height
<chr> <dbl>
1 Isa 160
2 Jaylinn 172
3 Mila 182
4 Milas 157
5 Yara 162
6 Yara 168
This will make the key, variable name, non-unique. Notice the result of a full join:
full_join(df1, df2, "name")
# A tibble: 8 x 3
name height age
<chr> <dbl> <dbl>
1 Isa 160 NA
2 Jaylinn 172 24
3 Mila 182 17
4 Milas 157 NA
5 Yara 162 17
6 Yara 168 17
7 Fiene NA 20
8 Noah NA 23
As you can see the join found two matches for Yara in df1, so both are taken and the age for Yara in df2 is replicated.
Use only unique key in joins to avoid ambiguous results.
What if we update df2 also with name Yara with age 19 and apply a join?
df2 <- tibble(name = c("Fiene","Jaylinn","Mila","Noah","Yara", "Yara"),
age = c(20,24,17,23,17,19))
df2
# A tibble: 6 x 2
name age
<chr> <dbl>
1 Fiene 20
2 Jaylinn 24
3 Mila 17
4 Noah 23
5 Yara 17
6 Yara 19
full_join(df1,df2,"name")
# A tibble: 10 x 3
name height age
<chr> <dbl> <dbl>
1 Isa 160 NA
2 Jaylinn 172 24
3 Mila 182 17
4 Milas 157 NA
5 Yara 162 17
6 Yara 162 19
7 Yara 168 17
8 Yara 168 19
9 Fiene NA 20
10 Noah NA 23
As you can see we have now 4 ambiguous observations for Yara
Take df1 as defined above and df3:
# height (cm)
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara"),
height = c(160, 172, 182, 157, 162))
# height (feet)
df3 <- df1 %>% mutate(height=height/30.48)
df3
# A tibble: 5 x 2
name height
<chr> <dbl>
1 Isa 5.25
2 Jaylinn 5.64
3 Mila 5.97
4 Milas 5.15
5 Yara 5.31
both of these tibbles have the variable height, after joining by name :
left_join(df1,df3,by="name")
# A tibble: 5 x 3
name height.x height.y
<chr> <dbl> <dbl>
1 Isa 160 5.25
2 Jaylinn 172 5.64
3 Mila 182 5.97
4 Milas 157 5.15
5 Yara 162 5.31
we see that left_join distinguishes height from df1 and the height from df3 with suffixes .x and .y respectively.
There are several more join functions not covered here such as right_join (opposite of left_join), semi_join and anti_join (see dplyr cheat sheet).
The base function duplicated can be used to find duplicates in a vector or a tibble.
v <- c("Isa","Jaylinn","Mila","Bonnie", "Yara", "Mila", "Isa", "Mila")
v
[1] "Isa" "Jaylinn" "Mila" "Bonnie" "Yara" "Mila" "Isa" "Mila"
Here Isa and Mila occur two and three times respectively. The function duplicated returns a logical with TRUE at corresponding positions whenever it matches a value that it had already seen from position 1 to n, n being the length of the vector in this case.
duplicated(v)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
duplicated(…) function does not mark the first occurence of the duplicated values with TRUE.
On tibbles the function duplicated can handle multiple variables(columns):
# the two observations on Yara are identical
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara", "Yara"),
height = c(160, 172, 182, 157, 162, 162))
df1
# A tibble: 6 x 2
name height
<chr> <dbl>
1 Isa 160
2 Jaylinn 172
3 Mila 182
4 Milas 157
5 Yara 162
6 Yara 162
duplicated(df1)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
We can use this logical vector to filter observations from df1:
df1 %>% filter(duplicated(df1))
# A tibble: 1 x 2
name height
<chr> <dbl>
1 Yara 162
Note that we only get 1 row as the result and not 2 because of the behaviour of duplicated function which does not mark the first match.
We can now test whether certain combinations of variables can uniquely identify each observation in the dataset. Let us take the variables {name,age} in the pulse dataset:
key1 <- pulse %>% select(name,age)
key1
# A tibble: 110 x 2
name age
<chr> <dbl>
1 Bonnie 18
2 Melanie 19
3 Consuelo 18
4 Travis 18
5 Lauri 18
6 George 22
7 Cherry 20
8 Francesca 18
9 Sonja 19
10 Troy 23
# … with 100 more rows
If the duplicated function returns FALSE for all positions in key1 tibble then it means it did not find any duplicates.
# sum over a logical vector counts the number of TRUE values
sum(duplicated(key1))
[1] 2
the sum here is non-zero therefore there are duplicated values, i.e. the key set key1 is not unique.
To illustrate let’s split the pulse dataset into two separate tibbles with common variables name and height:
p1 <- pulse %>% select(name,height,age,gender,year)
p1
# A tibble: 110 x 5
name height age gender year
<chr> <dbl> <dbl> <chr> <dbl>
1 Bonnie 173 18 female 1993
2 Melanie 179 19 female 1993
3 Consuelo 167 18 female 1993
4 Travis 195 18 male 1993
5 Lauri 173 18 female 1993
6 George 184 22 male 1993
7 Cherry 162 20 female 1993
8 Francesca 169 18 female 1993
9 Sonja 164 19 female 1993
10 Troy 168 23 male 1993
# … with 100 more rows
p2 <- pulse %>% select(-age,-gender,-year)
p2
# A tibble: 110 x 10
id name height weight smokes alcohol exercise ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_A Bonnie 173 57 no yes moderate sat 86 88
2 1993_B Melanie 179 58 no yes moderate ran 82 150
3 1993_C Consuelo 167 62 no yes high ran 96 176
4 1993_D Travis 195 84 no yes high sat 71 73
5 1993_E Lauri 173 64 no yes low sat 90 88
6 1993_F George 184 74 no yes low ran 78 141
7 1993_G Cherry 162 57 no yes moderate sat 68 72
8 1993_H Francesca 169 55 no yes moderate sat 71 77
9 1993_I Sonja 164 56 no yes high sat 68 68
10 1993_J Troy 168 60 no yes moderate ran 88 150
# … with 100 more rows
p1 has the common variables {name,height} (the key) and additional variables {age,gender,year}, and p2 has the same key variables {name,height} with some other measurement variables such as pulse1 pulse2 etc.
AnswerIs the combination
{name,height}a valid choice as the unique key to join two tibbles p1 and p2?
Now we want to join them back together using full_join:
full_join(p1,p2) # common variables are used when no variable is specified.
Joining, by = c("name", "height")
# A tibble: 110 x 13
name height age gender year id weight smokes alcohol exercise ran pulse1 pulse2
<chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Bonnie 173 18 female 1993 1993_A 57 no yes moderate sat 86 88
2 Melanie 179 19 female 1993 1993_B 58 no yes moderate ran 82 150
3 Consuelo 167 18 female 1993 1993_C 62 no yes high ran 96 176
4 Travis 195 18 male 1993 1993_D 84 no yes high sat 71 73
5 Lauri 173 18 female 1993 1993_E 64 no yes low sat 90 88
6 George 184 22 male 1993 1993_F 74 no yes low ran 78 141
7 Cherry 162 20 female 1993 1993_G 57 no yes moderate sat 68 72
8 Francesca 169 18 female 1993 1993_H 55 no yes moderate sat 71 77
9 Sonja 164 19 female 1993 1993_I 56 no yes high sat 68 68
10 Troy 168 23 male 1993 1993_J 60 no yes moderate ran 88 150
# … with 100 more rows
It is a better practice to explicitly specify the variables:
pulse_joined <- full_join(p1,p2, by = c("name","height"))
pulse_joined
# A tibble: 110 x 13
name height age gender year id weight smokes alcohol exercise ran pulse1 pulse2
<chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Bonnie 173 18 female 1993 1993_A 57 no yes moderate sat 86 88
2 Melanie 179 19 female 1993 1993_B 58 no yes moderate ran 82 150
3 Consuelo 167 18 female 1993 1993_C 62 no yes high ran 96 176
4 Travis 195 18 male 1993 1993_D 84 no yes high sat 71 73
5 Lauri 173 18 female 1993 1993_E 64 no yes low sat 90 88
6 George 184 22 male 1993 1993_F 74 no yes low ran 78 141
7 Cherry 162 20 female 1993 1993_G 57 no yes moderate sat 68 72
8 Francesca 169 18 female 1993 1993_H 55 no yes moderate sat 71 77
9 Sonja 164 19 female 1993 1993_I 56 no yes high sat 68 68
10 Troy 168 23 male 1993 1993_J 60 no yes moderate ran 88 150
# … with 100 more rows
You may check:
all_equal(pulse,pulse_joined) # caution: all_equal is an experimental function
[1] TRUE
⚠️ The function all_equal is an experimental function and may become obsolete!
Copyright © 2021 Biomedical Data Sciences (BDS) | LUMC